Machine learning systems for automated pharmaceutical molecule identification

ABSTRACT

Aspects of the present disclosure provide systems, methods, and computer-readable storage media that leverage artificial intelligence and machine learning to identify molecules or compounds for use in pharmaceuticals. In aspects, one or more machine learning (ML) models may be trained to identify molecules based on pharmaceutical data that indicates properties of previously-identified pharmaceutical molecules, such as physiochemical structure, side effects, toxicity, solubility, and the like. The ML models may include generative models, such as generative adversarial networks or variational autoencoders. The trained ML models may be used to identify new (e.g., previously-unidentified) molecules, or the trained ML models may be provided to client devices for use in molecule identification (e.g., drug discovery).

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to Indian Provisional PatentApplication No. 202041041670, filed on Sep. 25, 2020, entitled “DRUGDISCOVERY AND SEARCH USING MACHINE LEARNING,” and the presentapplication is related to co-pending U.S. patent application Ser. No.______ (Atty. Dkt. No. ACNT.P0028US), entitled “MACHINE LEARNING SYSTEMSFOR AUTOMATED PHARMACEUTICAL MOLECULE SCREENING AND SCORING,” filed Jan.21, 2021, the contents of each of which are expressly incorporatedherein in their entirety.

TECHNICAL FIELD

The present disclosure relates to systems and methods for leveragingmachine learning and artificial intelligence to automatically identifymolecules and compounds for use in pharmaceutical products.

BACKGROUND

Pharmaceuticals are one of the largest and most profitable industries inthe world, as illustrated by the worldwide pharmaceutical market beingworth approximately 1.3 trillion dollars in 2019 according to someestimates. In addition to researching and manufacturing new drugs (e.g.,pharmaceuticals) to cure or treat new diseases or conditions,pharmaceutical companies spend significant resources researching and“discovering” (e.g., identifying) new drugs for known diseases that haveincreased efficacy, fewer side effects, and fewer harmful druginteractions. For example, a pharmaceutical company may try to optimizeand improve an already manufactured drug for a specific disease with thegoal of improving efficacy and reducing side effects.

Designing (e.g., discovering or identifying) new drugs is typically amanually-intensive process. To design a new drug for a particulardisease, a human drug expert (e.g., a chemist, biochemist, researcher,etc.) may consider a known molecule or compound used in acurrently-available drug for treating the particular disease, and thehuman drug expert may identify multiple candidate molecules based on theknown molecule. For example, the human drug expert may decide to add anadditional element to, remove an element from, or modify thephysiochemical structure of, the known molecule or compound based ontheir experience and knowledge to design candidate molecules. Thecandidate molecules may be visually screened by the human drug expert,and a selected subset of candidate molecules that pass the visualscreening may be further screened using lab experiments or othertesting. Thus, the drug design (also referred to as drug discovery ordrug identification) process is limited by the knowledge and experienceof the human drug expert. Additionally, the human drug expert may focustheir attention on the particular disease to be treated, which mayresult in the human drug expert failing to explore or consider moleculesor compounds that are not widely known as useful in treating theparticular disease, which may limit the search space for the candidatemolecules.

Conventional drug discovery may be a long and expensive process. Forexample, each new drug, from discovery to launch, typically takesapproximately twelve to fifteen years and cost approximately 1.2 billiondollars. Additionally, the drug discovery process includes manydifferent steps such as discovery, optimization, preclinical trials,phased clinical trials, registration, and eventual launch. During manyor all of these steps, a significant portion of the candidate moleculesor compounds are filtered out or otherwise rejected. For example, bysome estimations, only approximately 1.8% of newly identified moleculesor compounds are successfully tested and implemented intopharmaceuticals released to consumers. Thus, the typical drug discoveryprocess is neither efficient nor cost-effective.

SUMMARY

Aspects of the present disclosure provide systems, methods, andcomputer-readable storage media for automated identification ofmolecules or compounds using machine learning for use in pharmaceuticalproducts such as drugs, medicine, remedies, and the like. The moleculesmay be identified by a drug discovery platform with minimal user inputas compared to other drug discovery systems. To facilitate automatedidentification of “new” molecules (e.g., previously unidentifiedmolecules), the drug discovery platform may train and leverageartificial intelligence and machine learning based on pharmaceuticaldata acquired from a variety of sources, such as publically availabledrug information databases, third party drug information databases,proprietary databases, and the like. The pharmaceutical data may includemultiple different forms or formats of drug-related data for a largequantity of previously-identified drugs (e.g., previously identifiedmolecules or compounds that make up the drugs). For example, thepharmaceutical data may include physiochemical data that indicatesphysiochemical properties of the previously-identified molecules, suchas the elements included in the molecules, the physiochemical structureof the molecules, the molecular weight of the molecules, theisomerization of the molecules, etc. The pharmaceutical data may also,or in the alternative, include other types of data, such as drug impactdata, side effect data, toxicity data, and solubility data, asnon-limiting examples. The pharmaceutical data may be processed andtransformed into a form that may be used as training data. For example,if the pharmaceutical data includes simplified molecular-inputline-entry system (SMILES)-formatted data that represents molecularstructure as a string of letters and characters, natural languageprocessing may be performed on the SMILES-formatted data to convert thestrings to numerical data for vectorization into training data. Suchtraining data may be used to train the artificial intelligence ormachine learning to automatically identify molecules that are distinctfrom the previously-identified molecules associated with thepharmaceutical data.

In aspects, a computing device (e.g., a server or other device thatimplements a drug discovery platform) may acquire pharmaceutical datafrom one or more databases, such as the publically available Zincdatabase (“Zinc15” or “Zinc12”), chEMBL database, PubChem database, andSIDER database (“SIDER Side Effect Resource”), as non-limiting examples.The pharmaceutical data may indicate properties (e.g., physiochemicalproperties, impact on a human body, side effects, toxicity, solubility,etc.) associated with multiple previously-identified pharmaceuticalmolecules. The computing device may convert at least a portion of thepharmaceutical data to training data. For example, the computing devicemay perform natural language processing on text data or SMILES-formatteddata to convert the text data or SMILES-formatted data into numericaldata. As another example, the computing device may convert categoricalvalues to numerical data, such as binary data or encoded numerical data(e.g., using a one-hot encoding, as a non-limiting example). Theconverted numerical data may be vectorized or otherwise grouped togenerate the training data. In some implementations, the computingdevice may perform pre-processing, such as filtering, outlier removal,filling in missing entries, dimensionality reduction, or the like on thepharmaceutical data prior to converting the pharmaceutical data to thetraining data.

After generating the training data, the computing system may train oneor more machine learning models based on the training data. Suchtraining may configure the machine learning models to identify new(e.g., previously-unidentified) pharmaceutical molecules. The machinelearning models may include regenerative models that generate new values(e.g., molecules) based on underlying similarities between valuesindicated by the training data (e.g., the previously-identifiedmolecules). In some implementations, the machine learning models includegenerative adversarial networks (GANs), variational autoencoders (VAEs),or both, which may be implemented using neural networks or other deeplearning structures. In some implementations, the machine learningmodels are trained to identify molecules having one or more particularproperties, such as a particular atomic weight, a particular molecularweight, a particular expected side effect, a particular solubility, orthe like. After training the machine learning models, the computingdevice may use the machine learning models to identify one or moremolecules for testing and potential trial. For example, the computingdevice may initiate display of a graphical user interface (GUI) thatincludes text, images, graphics, or a combination thereof, that indicatethe identified molecules, such as molecule names, names of elements thatmake up the molecules, two-dimensional graphical representations of themolecules, SMILES representations of the molecules, predicted propertiesof the molecules, and the like. Additionally or alternatively, thecomputing device may operate as a training device that trains themachine learning models and provides the machine learning models (ordata indicative of the configuration of the machine learning models) toclient devices for molecule identification at the client devices.

The present disclosure describes systems that provide improvementscompared to other drug discovery systems. For example, the presentdisclosure describes systems that train machine learning models toautomatically identify molecules that have not been previouslyidentified. Using artificial intelligence and machine learning toidentify pharmaceutical molecules based on pharmaceutical dataassociated with large quantities of previously-identified molecules,some of which are not related to the same type of drug, may result inidentification of a wider variety of new (e.g., previously-unidentified)pharmaceutical molecules. At least some of these molecules would not beidentified by a human drug expert (e.g., a chemist or biochemist)manually designing new molecules. To illustrate, the machine learningmodels are trained to identify molecules based on underlyingsimilarities between multiple drugs, and many of these underlyingsimilarities may not be apparent to the human drug expert. Thus, theidentified molecules may be more similar to successful drugs (even ifthe drugs are not used to treat the same disease or condition), andtherefore are more likely to be useful in producing new drugs thanmolecules that are manually identified by the human drug expert.Additionally, automated identification of the molecules may be fasterthan molecule identification by other systems that require substantialuser interaction and decision making by the human drug expert. Byincreasing the likelihood of identifying useful molecules in a shorterperiod of time, the systems and methods described herein maysubstantially reduce the costs and shorten the development cycleassociated with discovering and launching new drugs (e.g.,pharmaceuticals).

In a particular aspect, a method for pharmaceutical moleculeidentification using machine learning includes obtaining, by one or moreprocessors, pharmaceutical data indicating properties ofpreviously-identified pharmaceutical molecules from one or moredatabases. The pharmaceutical data includes molecular physiochemicaldata, drug impact data, side effect data, toxicity data, solubilitydata, or a combination thereof. The method also includes performing, bythe one or more processors, natural language processing (NLP) on atleast a portion of the pharmaceutical data to convert the at least aportion of the pharmaceutical data to training data. The training dataincludes vectorized representations of the properties of thepreviously-identified pharmaceutical molecules. The method furtherincludes training, by the one or more processors, one or more machinelearning (ML) models based on the training data to configure the one ormore ML models to identify additional pharmaceutical molecules. Theadditional pharmaceutical molecules are distinct from thepreviously-identified pharmaceutical molecules.

In another particular aspect, a system for pharmaceutical moleculeidentification using machine learning includes a memory and one or moreprocessors communicatively coupled to the memory. The one or moreprocessors are configured to obtain pharmaceutical data indicatingproperties of previously-identified pharmaceutical molecules from one ormore databases. The pharmaceutical data includes molecularphysiochemical data, drug impact data, side effect data, toxicity data,solubility data, or a combination thereof. The one or more processorsare also configured to perform NLP on at least a portion of thepharmaceutical data to convert the at least a portion of thepharmaceutical data to training data. The training data includesvectorized representations of the properties of thepreviously-identified pharmaceutical molecules. The one or moreprocessors are further configured to train one or more ML models basedon the training data to configure the one or more ML models to identifyadditional pharmaceutical molecules. The additional pharmaceuticalmolecules are distinct from the previously-identified pharmaceuticalmolecules.

In another particular aspect, a non-transitory computer-readable storagemedium stores instructions that, when executed by one or moreprocessors, cause the one or more processors to perform operations forpharmaceutical molecule identification using machine learning. Theoperations include obtaining pharmaceutical data indicating propertiesof previously-identified pharmaceutical molecules from one or moredatabases. The pharmaceutical data includes molecular physiochemicaldata, drug impact data, side effect data, toxicity data, solubilitydata, or a combination thereof. The operations also include performingNLP on at least a portion of the pharmaceutical data to convert the atleast a portion of the pharmaceutical data to training data. Thetraining data includes vectorized representations of the properties ofthe previously-identified pharmaceutical molecules. The operationsfurther include training one or more ML models based on the trainingdata to configure the one or more ML models to identify additionalpharmaceutical molecules. The additional pharmaceutical molecules aredistinct from the previously-identified pharmaceutical molecules.

In the context of the present disclosure the terms “molecule” and“compound” can be used interchangeably. Non-limiting examples ofmolecules and compounds can include small molecules and biologics. Inone non-limiting aspect, small molecules can be chemically derived suchas by being manufactured through chemical synthesis or isolated fromanother material having the small molecule. In one non-limiting aspect,biologics can include a material or substance extracted from,synthesized by, or manufactured from living organisms (e.g.,microorganisms, plants, animals, cells, etc.). Non-limiting examples ofbiologics can include sugars, polymers, peptides, proteins, enzymes, ornucleic acids or combinations thereof.

The foregoing has outlined rather broadly the features and technicaladvantages of the present disclosure in order that the detaileddescription that follows may be better understood. Additional featuresand advantages will be described hereinafter which form the subject ofthe claims of the disclosure. It should be appreciated by those skilledin the art that the conception and specific aspects disclosed may bereadily utilized as a basis for modifying or designing other structuresfor carrying out the same purposes of the present disclosure. It shouldalso be realized by those skilled in the art that such equivalentconstructions do not depart from the scope of the disclosure as setforth in the appended claims. The novel features which are disclosedherein, both as to organization and method of operation, together withfurther objects and advantages will be better understood from thefollowing description when considered in connection with theaccompanying figures. It is to be expressly understood, however, thateach of the figures is provided for the purpose of illustration anddescription only and is not intended as a definition of the limits ofthe present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference isnow made to the following descriptions taken in conjunction with theaccompanying drawing, in which:

FIG. 1 is a block diagram of an example of a system for pharmaceuticalmolecule identification using machine learning according to one or moreaspects;

FIG. 2 is a block diagram of another example of a system forpharmaceutical molecule identification using machine learning accordingto one or more aspects;

FIG. 3 is a flow diagram illustrating an example of a method foridentifying pharmaceutical molecules and for identifying uses forpharmaceutical molecules according to one or more aspects; and

FIG. 4 is a flow diagram illustrating an example of a method forpharmaceutical molecule identification using machine learning accordingto one or more aspects.

It should be understood that the drawings are not necessarily to scaleand that the disclosed aspects are sometimes illustrateddiagrammatically and in partial views. In certain instances, detailswhich are not necessary for an understanding of the disclosed methodsand apparatuses or which render other details difficult to perceive mayhave been omitted. It should be understood, of course, that thisdisclosure is not limited to the particular aspects illustrated herein.

DETAILED DESCRIPTION

Aspects of the present disclosure provide systems, methods, andcomputer-readable storage media for automated identification ofmolecules or compounds using machine learning for use in pharmaceuticalproducts such as drugs, medicine, remedies, cosmetics, and the like. Thetechniques described herein support identification, using artificialintelligence and machine learning techniques, of molecules that may havenot previously been identified, tested, and/or studied for a givenpharmaceutical application (e.g., identification of new molecules forexisting or new uses, disease states, or conditions and/oridentification of existing molecules for new uses, disease states, orconditions). The artificial intelligence and machine learning techniquesdescribed herein may be trained using a variety of pharmaceutical dataassociated with previously identified drugs, such as physiochemical data(e.g., data indicating the elements and structures ofpreviously-identified molecules), drug impact data, side effect data,toxicity data, solubility data, other drug-related information, and thelike, as non-limiting examples. The pharmaceutical data used fortraining may be obtained from a variety of sources, such as publiclyavailable drug information databases such as the Zinc database,third-party databases (e.g., drug vendor or manufacturer databases,university databases, government agency databases, and the like),proprietary databases, or a combination thereof. Natural languageprocessing may be performed on text data or particularly-formatted data,such as simplified molecular-input line-entry system (SMILES)-formatteddata, to generate training data for training generative machine learningmodel(s) to identify pharmaceutical molecules. Using artificialintelligence and machine learning to identify pharmaceutical moleculesbased on pharmaceutical data associated with large quantities ofpreviously-identified molecules, some of which are not related to thesame type of drug, may result in identification of pharmaceuticalmolecules that would not be identified by a human (e.g., a chemist orbiochemist) using existing drug discovery processes. To illustrate,because the artificial intelligence and machine learning are able todetermine underlying similarities between more drugs, many of which maynot be apparent to a human, the identified molecules may be more similarto successful drugs, and thus more likely to be useful in producing newdrugs, than molecules that are manually identified by a human.Additionally, automated identification of the molecules may be fasterthan molecule identification by other systems that require substantialuser interaction and decision making. By increasing the likelihood ofidentifying useful molecules in a shorter period of time, the systemsand methods described herein may substantially reduce the costs andshorten the development cycle associated with discovering and launchingnew drugs (e.g., pharmaceuticals). Although described in the context ofpharmaceutical products (e.g., drugs), the techniques of the presentdisclosure may be applied to identify molecules for use in other typesof products, such as health products and supplements, personal hygieneproducts, cosmetic products, biotech products, chemical products, andthe like.

Referring to FIG. 1, an example of a system for pharmaceutical moleculeidentification (e.g., drug discovery) using machine learning accordingto one or more aspects is shown as a system 100. The system 100 may beconfigured to train machine learning model(s) to identify “new”pharmaceutical molecules or compounds (e.g., previously-unidentifiedmolecules or compounds for use in drugs or other pharmaceuticalproducts) using information associated with previously-identifiedpharmaceutical molecules or compounds. In some implementations, thesystem 100 may use the trained machine learning model(s) to identify oneor more pharmaceutical molecules for production and testing.Additionally or alternatively, the trained machine learning model(s) maybe provided to other devices, such as client device(s), for use inpharmaceutical molecule identification. As shown in FIG. 1, the system100 includes a computing device 102, a display device 130, one or moredatabases 150, a client device 162, a drug production system 164, andone or more networks 170. In some implementations, one or more of thedisplay device 130, the client device 162, or the drug production system164 may be optional, or the system 100 may include additionalcomponents, such as a user device, as a non-limiting example.

The computing device 102 (e.g., a pharmaceutical molecule identificationdevice or a drug identification device) may include or correspond to adesktop computing device, a laptop computing device, a personalcomputing device, a tablet computing device, a mobile device (e.g., asmart phone, a tablet, a personal digital assistant (PDA), a wearabledevice, and the like), a server, a virtual reality (VR) device, anaugmented reality (AR) device, an extended reality (XR) device, avehicle (or a component thereof), an entertainment system, othercomputing devices, or a combination thereof, as non-limiting examples.The computing device 102 includes one or more processors 104, a memory106, one or more communication interfaces 120, a data processing andtransformation engine 122 a training engine 124, one or more machinelearning (ML) models 126, and an identification engine 128. It is notedthat functionalities described with reference to the computing device102 are provided for purposes of illustration, rather than by way oflimitation and that the exemplary functionalities described herein maybe provided via other types of computing resource deployments. Forexample, in some implementations, computing resources and functionalitydescribed in connection with the computing device 102 may be provided ina distributed system using multiple servers or other computing devices,or in a cloud-based system using computing resources and functionalityprovided by a cloud-based environment that is accessible over a network,such as the one of the one or more networks 170. To illustrate, one ormore operations described herein with reference to the computing device102 may be performed by one or more servers or a cloud-based system thatcommunicates with one or more client or user devices.

The one or more processors 104 may include one or more microcontrollers,application specific integrated circuits (ASICs), field programmablegate arrays (FPGAs), central processing units (CPUs) having one or moreprocessing cores, or other circuitry and logic configured to facilitatethe operations of the computing device 102 in accordance with aspects ofthe present disclosure. The memory 106 may include random access memory(RAM) devices, read only memory (ROM) devices, erasable programmable ROM(EPROM), electrically erasable programmable ROM (EEPROM), one or morehard disk drives (HDDs), one or more solid state drives (SSDs), flashmemory devices, network accessible storage (NAS) devices, or othermemory devices configured to store data in a persistent ornon-persistent state. Software configured to facilitate operations andfunctionality of the computing device 102 may be stored in the memory106 as instructions 108 that, when executed by the one or moreprocessors 104, cause the one or more processors 104 to perform theoperations described herein with respect to the computing device 102, asdescribed in more detail below. Additionally, the memory 106 may beconfigured to store data, such as training data 110, one or moreidentified molecules 112, selected properties 114, and additionaltraining data 116. Exemplary aspects of the training data 110, theidentified molecules 112, the selected properties 114, and theadditional training data 116 are described in more detail below.

The one or more communication interfaces 120 may be configured tocommunicatively couple the computing device 102 to the one or morenetworks 170 via wired or wireless communication links establishedaccording to one or more communication protocols or standards (e.g., anEthernet protocol, a transmission control protocol/internet protocol(TCP/IP), an Institute of Electrical and Electronics Engineers (IEEE)802.11 protocol, an IEEE 802.16 protocol, a 3rd Generation (3G)communication standard, a 4th Generation (4G)/long term evolution (LTE)communication standard, a 5th Generation (5G) communication standard,and the like). In some implementations, the computing device 102includes one or more input/output (I/O) devices that include one or moredisplay devices, a keyboard, a stylus, one or more touchscreens, amouse, a trackpad, a microphone, a camera, one or more speakers, hapticfeedback devices, or other types of devices that enable a user toreceive information from or provide information to the computing device102. In some implementations, the computing device 102 is coupled to thedisplay device 130, such as a monitor, a display (e.g., a liquid crystaldisplay (LCD) or the like), a touch screen, a projector, a virtualreality (VR) display, an augmented reality (AR) display, an extendedreality (XR) display, or the like. Although shown as external to thecomputing device 102 in FIG. 1, in some other implementations, thedisplay device 130 is included in or integrated in the computing device102.

The data processing and transformation engine 122 may be configured toobtain pharmaceutical data 132 from the databases 150 and to process,filter, or otherwise transform the pharmaceutical data 132 for use byother components of the computing device 102. For example, thepharmaceutical data 132 may indicate properties of previously-identifiedpharmaceutical molecules, and the data processing and transformationengine 122 may process and otherwise convert the pharmaceutical data 132(or a portion thereof) to a common format that may be used for analysisand training data generation. To illustrate, the data processing andtransformation engine 122 may be configured to perform one or morepre-processing operations, one or more formatting operations, one ormore conversion operations, one or more filtering operations, or acombination thereof, on the pharmaceutical data 132 to convert thepharmaceutical data 132 to a target format, to reduce a size orcomplexity of the pharmaceutical data 132, to eliminate particularvalues that do not provide sufficient information, to add in missingvalues, or a combination thereof.

The training engine 124 may be configured to generate the training data110 based on the processed pharmaceutical data 132. For example, thetraining engine 124 may extract a particular set of features from thepharmaceutical data 132 and group the extracted features, such as in oneor more vectors, to generate the training data 110. In someimplementations, the particular set of features are determined based onfeature analysis of the pharmaceutical data 132 and are predeterminedfor all types of molecule identification, or the particular set offeatures may be based on a type of molecule to be identified, aparticular disease or condition for which molecules are to beidentified, particular properties associated with identified molecules,user input, or a combination thereof. To extract the features, thetraining engine 124 may be configured to extract numerical features fromnumerical data, to extract categorical features from text or numericaldata and convert the categorical features to numerical features, toperform natural language processing (NLP) on text data to convert textfeatures into numerical features, or a combination thereof. In someimplementations, the training engine 124 may be configured to scale orotherwise transform extracted features to a format that is useable totrain ML models. After extracting the features, the training engine 124may group or otherwise format the extracted features, such as performingvectorization on the extracted features, to generate the training data110.

After generating the training data 110, the training engine 124 may beconfigured to train the one or more ML models 126 that are accessible tothe training engine 124 (e.g., via storage at the memory 106 or otherstorage devices) based on the training data 110. The one or more MLmodels 126 may be trained to identify “new” molecules (e.g.,previously-unidentified molecules) based on properties ofpreviously-identified pharmaceutical molecules indicated by the trainingdata 110. As used herein, new or previously-unidentified pharmaceuticalmolecules encompass small molecules and/or biologics that may have atherapeutic effect such that the molecule may be used as an ingredientin a drug or other medicinal product (e.g., pharmaceuticals). In somenon-limiting aspects, the molecules may include multiple atoms of thesame element or compounds (e.g., molecules made of atoms from differentelements). Such previously-unidentified molecules may include differentcombinations of elements than the previously-identified molecules,different structures of known combinations of elements, or differentcombinations of elements and different structures than thepreviously-identified molecules. Additionally or alternatively, thepreviously-unidentified molecules may include molecules that have notpreviously been identified as having a pharmaceutical effect. In someimplementations, the one or more ML models 126 may be trained toidentify molecules having particular properties (e.g., physiochemicalstructures, toxicity, solubility, etc.) based on the training data 110,such as by weighting or labeling training data based on relationships tothe particular properties.

In some implementations, the one or more ML models 126 (referred toherein as the ML models 126) may include a single ML model or multipleML models configured to identify molecules. In some implementations, theML models 126 may include or correspond to generative ML models. Forexample, the ML models 126 may include generative adversarial networks(GANs), such as multi-objective GANs, objective reinforced GANs,conditional deep GANs, and the like, variational autoencoders (VAEs),such as standard VAEs, multi-objective VAEs, and the like, or acombination thereof. Generative modeling is an unsupervised learningtask that involves automatically discovering and learning patterns orrelationships in input data in such a way that a model can be used togenerate or output new examples that plausibly could have been drawnfrom the input data set. GANs can be used to frame the problem as asupervised learning problem with two sub-models: a generator model thatis trained to generate new examples, and a discriminator model that istrained to classify examples as either real (e.g., from the input dataset) or fake (e.g., from the generator model). The two models, typicallyconvolutional neural networks, are trained together in a zero-sum game,until the discriminator is fooled by the generator a particularpercentage of the time. VAEs may be configured to learn efficient datacodings in an unsupervised manner, such as by encodinghigher-dimensionality input data as probability distributions of latentvariables, and decoding the probability distributions of the latentvariables to create slightly different versions of the input data. Insome implementations, the ML models 126 (e.g., the GANs, the VAEs, orboth) may be implemented as neural networks. In other implementations,the ML models 126 may be implemented as other types of ML models orconstructs, such as decision trees, random forests, regression models,Bayesian networks (BNs), dynamic Bayesian networks (DBNs), naive Bayes(NB) models, Gaussian processes, hidden Markov models (HMMs), and thelike.

The identification engine 128 may be configured to identify theidentified molecules 112 that are distinct from thepreviously-identified molecules. For example, the identification engine128 may provide input data to the ML models 126 to cause the ML models126 to identify pharmaceutical molecules that may have an underlyingsimilarity to molecules indicated by the input data, and therefore maybe more likely than random molecules to have a pharmaceutical effect orto exhibit particular properties. To generate the input data, theidentification engine may sample previously-identified molecules used asingredients in drugs to remedy a particular disease or condition,previously-identified molecules that exhibit particular properties, arandom sampling of previously-identified molecules, or a sampling basedon other parameters.

The databases 150 may include one or more databases, or other storagedevices, configured to maintain and provide access to storedpharmaceutical data. The databases 150 may include publically availabledrug information databases (e.g., databases maintained by information orstandards organizations or government agencies such as the Food and DrugAdministration (FDA), the Center for Disease Control (CDC), and thelike), third-party drug information databases (e.g., databasesmaintained by pharmaceutical vendors or researchers, universities, andthe like), proprietary databases (e.g., databases maintained by anentity that operates the computing device 102), other databases, or acombination thereof. Particular, non-limiting examples of publicallyavailable or accessible databases include the ZINC database (“Zinc15” or“Zinc12”), the chEMBL database, the PubChem database, the SIDER database(“SIDER Side Effect Resource”), the Binding Database (“BindingDB”), theDrugBank database (“DrugBank Online”), and the like.

The databases 150 are configured to store the pharmaceutical data 132that indicates properties, such as physiochemical properties, efficacy,human interactions, side effects, and the like, associated with multiplepreviously-identified pharmaceutical molecules. In some implementations,the databases 150 are configured to store (e.g., the pharmaceutical data132 includes) physiochemical data 152, drug impact data 154, side effectdata 156, toxicity data 158, solubility data 160, other pharmaceuticaldata, or a combination thereof. The physiochemical data 152 may indicatephysiochemical structures of the previously-identified molecules, suchas the elements, and shape or structure of the elements, that form thepreviously-identified molecules. In some implementations, at least aportion of the physiochemical data 152 may be formatted in accordancewith the simplified molecular-input line-entry system (SMILES). SMILESis a line notation for describing the structure of chemical speciesusing short ASCII strings that include letters and numbers indicatingelements (and their respective quantity) and other symbols (e.g., −=# $: / \) that represent different types of bonds between the elements. Asan example, a molecule of carbon dioxide may be represented as O═C═O inthe SMILES notation. The SMILES notation is designed such that amolecule represented by a SMILES notation can be easily converted to atwo-dimensional or three-dimensional model of the respective molecule.

The drug impact data 154 may indicate impacts and effects of thepreviously-identified molecules on the human body (e.g., on a recipientof the drug), such as changes experienced with respect to symptoms of adisease or condition, effects on functioning of the body, effects onorgans or other body parts, and the like. The side effect data 156 mayindicate side effects associated with the previously-identifiedmolecules on the human body, such as effects on functioning of the bodyor parts of the body that are unrelated to treatment of the disease orcondition. The toxicity data 158 may indicate measurements of the toxiceffects of the previously-identified drugs on the human body, such asthe LD₅₀ (e.g., the median lethal dose), as a non-limiting example. Thesolubility data 160 may indicate measurements of the solubility of thepreviously-identified drugs in solvents, such as water and organicsolvents (methanol, ethanol, propanol, acetone, ethyl acetate, hexane,heptane, dichloromethane, tetrahydrofuran, acetonitrile,dimethylformanide, toluene, dimethysulfoxide, etc.), or a combinationthereof. The above-described types of pharmaceutical data are notintended to be limiting, and in other implementations, other types ofpharmaceutical data may be stored by the databases 150, such as affinitydata, selectivity data, efficacy data, metabolic stability data, oralbioavailability data, and the like.

The client device 162 may include or correspond to a computer deviceused by a client of the entity that operates the computing device 102 toperform molecule identification (e.g., drug discovery). For example, theclient device 162 may be operated by a pharmaceutical company, auniversity, a research institution, or the like, that is engaged in drugdiscovery. The client device 162 may include or correspond to acomputing device, such as a desktop computer or a laptop computer, aserver, a mobile device (e.g., a smart phone, a tablet computer, awearable device, a personal digital assistant (PDA), or the like), anaudio/visual device, an entertainment device, a control device, avehicle (or a component thereof), a VR device, an AR device, an XRdevice, or the like. The client device 162 may be configured to receivethe trained ML models 126 (or configuration data associated with thetrained ML models 126) for use in the drug discovery process.

The drug production system 164 may include one or more automated orsemi-automated equipment or devices configured to perform operations ofdrug formulation. For example, the drug production system 164 mayinclude or correspond to agitators, blowers, boilers, centrifuges,chillers, cooling towers, dryers, homogenizers, mixers, ovens, and thelike. Components of the drug production system 164 may includeprocessors, memories, interfaces, motors, sensors, and the like that areconfigured to enable fully or semi-automated performance of one or moreoperations, in addition to communication with other components of thedrug production system 164 or other devices. In some implementations,the drug production system 164 may be configured to receive instructionsfrom the computing device 102 for initiating one or more operations.

During operation of the system 100, the computing device 102 may obtainthe pharmaceutical data 132 from the databases 150. For example, thecomputing device 102 may query the databases 150 and receive thepharmaceutical data 132 (or a portion thereof). As another example, thecomputing device 102 may manually pull the pharmaceutical data 132 (or aportion thereof) from the databases 150 using one or more pull commands.As another example, the computing device 102 may extract thepharmaceutical data 132 (or a portion thereof) from websites or otherpublically accessible information displays that are supported by thedatabases 150, such as using a crawler or other data mining techniques.As described above, the pharmaceutical data 132 may indicate propertiesof multiple previously-identified pharmaceutical molecules. In someimplementations, the pharmaceutical data 132 may include thephysiochemical data 152, the drug impact data 154, the side effect data156, the toxicity data 158, the solubility data 160, other types ofpharmaceutical data, or a combination thereof.

The data processing and transformation engine 122 may process andtransform the pharmaceutical data 132, such as transforming differenttypes of data included in the pharmaceutical data 132 to a common formator type. In some implementations, the data processing and transformationengine 122 may perform pre-processing on the pharmaceutical data 132.Performing the pre-processing may reduce complexity of featureextraction to be performed on the pharmaceutical data 132, reduce thememory footprint associated with the pharmaceutical data 132, clean upthe pharmaceutical data 132, format the pharmaceutical data 132, or acombination thereof. For example, the pre-processing may includeperforming statistical analysis on the pharmaceutical data 132 to removeor modify an outlier from the pharmaceutical data 132, removing an entryfrom the pharmaceutical data 132 that is associated with a variance thatfails to satisfy a variance threshold, formatting the pharmaceuticaldata 132, approximating a missing entry of the pharmaceutical data 132(e.g., using interpolation or other statistical modeling techniques),other pre-processing operations, or a combination thereof. Additionallyor alternatively, the data processing and transformation engine 122 mayperform dimensionality reduction on the pharmaceutical data 132 (orextracted features) to reduce a memory footprint associated with thepharmaceutical data 132 and to reduce processing complexity of thefeature extraction performed by the training engine 124. Thedimensionality reduction may project the pharmaceutical data 132 onto alower-dimension feature space, such as by primary component analysis,singular value decomposition, or the like.

The training engine 124 may generate the training data 110 based on theprocessed pharmaceutical data 132 from the data processing andtransformation engine 122. Generating the training data 110 may includeextracting a predetermined set of features from the pharmaceutical data132, which may include performing one or more operations to convert thepharmaceutical data 132 to a different type of data from which featuresthat are acceptable to the ML models 126 may be extracted. In someimplementations, the training engine 124 may extract numerical featuresfrom the pharmaceutical data 132. For example, the numerical featuresmay include toxicity measurements, solubility measurements, atomicweights, or the like. The training engine 124 may scale or otherwisetransform the extracted numerical features, such as performing anormalization transformation, a standardization transformation, a powertransformation, a quantile transformation, or a combination thereof, onthe extracted numerical features. Additionally or alternatively, thetraining engine 124 may extract numerical features from non-numericalfeatures in the pharmaceutical data 132. As an example, the trainingengine 124 may convert categorical features or binary features tointeger values, such as ‘1’ or ‘0’ for ‘yes’ and ‘no,’ respectively, orcreate integer values from multiple different categories, such as usinga one-hot encoding. As another example, the training engine 124 mayperform NLP on text data of the pharmaceutical data 132 to convert thetext data into numerical features. The NLP may include tokenization,removing stop words, stemming, lemmatization, bag of words processing,other NLP, or a combination thereof. In some implementations, at least aportion of the pharmaceutical data 132, such as the physiochemical data152, may be SMILES-formatted text data. For example, physiochemicalstructures of the previously-identified molecules may be represented bystrings of characters according to the SMILES notation, such as O═C=Ofor carbon dioxide. In such implementations, the training engine 124 mayperform NLP on the SMILES-formatted strings to convert theSMILES-formatted strings to numerical features, such as numbers ofvarious elements, numbers of various types of bonds, correspondencebetween the bonds and the elements, etc. As other example, the trainingengine 124 may perform NLP on text data included in the pharmaceuticaldata 132 to extract numerical features corresponding to other textualinformation, such as drug impact or efficacy information associated withthe previously-identified molecules, side effects associated with thepreviously-identified molecules, and the like. After extracting thefeatures, the training engine 124 may vectorize or otherwise group theextracted features to a format that may be processed by the ML models126 to generate the training data 110.

After generating the training data 110, the training engine 124 maytrain the ML models 126 to identify the identified molecules 112 basedon the training data 110. In some implementations, training the MLmodels 126 may include segmenting the training data 110 into a trainingset and a test set. The training engine 124 may provide the training setto the ML models 126 to train the ML models 126 to identify molecules(e.g., the identified molecules 112) based on underlying similaritiesbetween the previously-identified molecules that are derived from thetraining set. In addition to, or as part of the training, the trainingengine 124 may adjust one or more parameters or hyper-parametersassociated with the ML models. In some implementations, the trainingengine 124 may train the ML models 126 to identify the identifiedmolecules 112 that have (or are predicted or likely to have) particularproperties. To illustrate, the computing device 102 may obtain theselected properties 114 and generate the training data 110 and train theML models 126 such that the identified molecules 112 have (or arepredicted to have) the selected properties 114. In some implementations,the computing device 102 may receive the selected properties 114 from anI/O device or a user device that receives user input indicating theselected properties 114. Additionally or alternatively, the computingdevice 102 may determine the selected properties 114 based on a targetdiseases or condition for which molecules are to be identified. As anexample, the selected properties 114 may include particularphysiochemical structures (e.g., particular elements, particular typesof bonds, or the like), a particular solubility, lack of a particularside effect, or the like. To train the ML models 126 to identifymolecules having the selected properties 114, the training engine 124may assign greater weighting values to portions of the training data 110that are associated with previously-identified molecules that have theselected properties 114 than to portions of the training data 110 thatare associated with previously-identified molecules that do not have theselected properties 114, as a non-limiting example.

In some implementations, after the training engine 124 trains the MLmodels 126, the identification engine 128 may access the ML models 126to identify one or more previously-unidentified molecules (e.g., theidentified molecules 112). The computing device 102 may generate anoutput 134 that indicates the identified molecules 112. The output 134may be displayed to a user, provided to another device, or used toinitiate performance of one or more operations. As an example, thecomputing device 102 may provide the output 134 to the display device130 to cause the display device 130 to display a graphical userinterface (GUI). The GUI may include text indicating the identifiedmolecules 112 (e.g., names of the identified molecules 112, SMILESstrings indicating the physiochemical structure of the identifiedmolecules 112, and the like), visual representations of the identifiedmolecules (e.g., 2D or 3D representations of the molecular structure),other text or multimedia content representing the identified molecules112, or a combination thereof. Additionally or alternatively, the GUImay include text, graphical, or multimedia content that indicatesproperties of the identified molecules 112, such as a list of sideeffects, solubility measurements, toxicity measurements, likely impactedorgans, and the like, and/or comparisons of the properties of theidentified molecules 112 to properties of previously-identifiedmolecules, such as graphs, charts, or the like. As another example, thecomputing device 102 may provide the output 134 to another device, suchas the client device 162 or a user device. As another example, thecomputing device 102 may provide the output 134 to the drug productionsystem 164 to initiate performance of one or more operations at the drugproduction system 164. To illustrate, the output 134 may include orcorrespond to one or more instructions that cause the drug productionsystem 164 to perform one or more operations to facilitate formation ofthe identified molecules 112. For example, the one or more instructionsmay initiate mixing of chemicals in a mixer, activating a heater or acooler to change a state of a chemical, retrieving of one or moresamples from a vault or other storage location, or the like.

Additionally or alternatively, the computing device 102 may provide thetrained ML models 126 to the client device 162. For example, aftertraining the ML models 126, the computing device 102 may generateconfiguration information that indicates the parameters, thehyper-parameters, and any other configuration of the trained ML models126, and the computing device 102 may provide the configurationinformation to the client device 162 to enable the client device 162 toimplement the trained ML models 126 at the client device 162 foridentifying molecules as part of drug discovery performed at the clientdevice 162. In some implementations, the computing device 102 may beconfigured to train the ML models 126 but not to perform moleculeidentification, instead leaving the molecule identification to beperformed by the client device 162. In such implementations, thecomputing device 102 does not include the identification engine 128.

In some implementations, the training engine 124 may further train theML models 126 based on results associated with the identified molecules112. To illustrate, the training engine 124 may receive testing dataassociated with tests of the identified molecules 112, and the trainingengine 124 may generate the additional training data 116 based on thetesting data and the identified molecules 112 using the techniquesdescribed above for the training data 110. For example, the testing datamay indicate properties of the identified molecules 112, such assolubility or toxicity of the identified molecules 112, as well asobservations from clinical testing of drugs formed from the identifiedmolecules, such as observed success (or failure) in treating aparticular disease or condition, effects on the patients, side effectsexperienced by the patients, and the like. In this manner, the ML models126 may be dynamically updated to improve the utility of moleculesidentified by the ML models 126 based on new information.

As described above, the system 100 supports training of the ML models126 to automatically identify the identified molecules 112 (e.g.,pharmaceutical molecules that have not been previously identified).Using artificial intelligence and machine learning to identify theidentified molecules 112 based on the pharmaceutical data 132, which maybe associated with large quantities of previously-identifiedpharmaceutical molecules used as ingredients in related and unrelateddrugs, may result in identification of a wider variety of newpharmaceutical molecules (e.g., previously-unidentified pharmaceuticalmolecules). At least some of these molecules would not be identified bya human drug expert (e.g., a chemist or biochemist) using existing drugdiscovery processes. To illustrate, the ML models 126 may be trained toidentify the identified molecules 112 based on underlying similaritiesbetween multiple previously-identified molecules, and many of theseunderlying similarities may not be apparent to the human drug expert.Thus, the identified molecules 112 may be more similar to successfuldrugs (even if the drugs are not used to treat the same disease orcondition), and therefore are more likely to be useful in producing newdrugs than molecules that are manually identified by the human drugexpert. The ML models 126 may also be trained based on testing resultsassociated with the identified molecules 112 to improve the quality ofthe molecule identification performed by the ML models 126.Additionally, automated identification of the identified molecules 112by the ML models 126 may be faster than molecule identification by othersystems that require substantial user interaction and decision making bythe human drug expert. By increasing the likelihood of identifyinguseful molecules in a shorter period of time, the system 100 maysubstantially reduce the costs and shorten the development cycleassociated with discovering and launching new drugs.

Referring to FIG. 2, another example of a system for pharmaceuticalmolecule identification using machine learning according to one or moreaspects is shown as a system 200. In some implementations, the system200 may include or correspond to the system 100 of FIG. 1. As shown inFIG. 2, the system 200 (also referred to as a drug discovery platform)includes data sources 202, a data import layer 210, a data storage layer220, a data transformation layer 230, an artificial intelligence/machinelearning (AI/ML) engine 240, an access layer 250, an applicationprogramming interface (API) management layer 260, other device 270, anda message orchestration and logging layer 280.

The data sources 202 include multiple data sources, such as databases,for accessing pharmaceutical data for use in training ML models toidentify pharmaceutical molecules. In the particular implementationillustrated in FIG. 2, the data sources 202 may include a drug bank 204,the ZINC database 206, a binding database 208, and the chEMBL database209. In other implementations, the data sources 202 may include otherdata sources, such as other publically available databases, third partydatabases, proprietary databases, or a combination thereof, as furtherdescribed with reference to FIG. 1. The drug bank 204 may include adatabase of drugs, such as those released by an operator or client ofthe system 200, or a third party. The drug bank 204 may storeinformation associated with the drugs, such as physiochemical structuresof molecules used as ingredients, efficacy data, side effects associatedwith the drugs, and the like. The ZINC database 206 is a publicallyavailable database that maintains pharmaceutical data for multiplepharmaceutical molecules. For example, the ZINC database 206 may storeSMILES-formatted ligand structures, molecular weights, partitioncoefficients (Log P values), druglikeness metrics (QED values),molecular ring structure data, hydrogen bond (H-bond) donor and acceptordata, target class data, and the like. The binding database 208 maystore data that indicates binding information for pharmaceuticalmolecules to various proteins, which is useful in screening the newlyidentified molecules for their success in treating different diseases orconditions, as further described herein with reference to FIG. 3. Forexample, the binding database 208 may store ligand names,SMILES-formatted ligand structures, target names, half maximalinhibitory concentration (IC₅₀) values, and the like. The chEMBLdatabase 209 is a publically available database that maintainspharmaceutical data for multiple pharmaceutical molecules, similar tothe ZINC database 206.

The data import layer 210 may be configured to import (e.g., obtain)pharmaceutical data from the data sources 202 for use as training data.The data import layer 210 may be configured to request and receive thepharmaceutical data from the data sources 202, to extract thepharmaceutical data from information supported by the data sources 202,to pull the pharmaceutical data from the data sources 202, or acombination thereof. For example, the data import layer 210 may includePython scripts 212, a crawler 214, and manual pull logic 216. The Pythonscripts 212 may be executable scripts in Python (or another scriptinglanguage) that, when executed by the data import layer 210, cause thedata import layer 210 to request and/or query the data sources 202 forvarious pharmaceutical data. In some implementations, the Python scripts212 may be configured to interact with one or more applicationprogramming interfaces (APIs) of the data sources 202 to receive thepharmaceutical data. The crawler 214 may include or correspond to a webcrawler, or other data mining application, that is configured to extractpharmaceutical data from websites (or other sources) that are supportedby the data sources 202. The manual pull logic 216 may be configured toperform one or more pull operations with respect to the data sources 202to retrieve pharmaceutical data.

The data storage layer 220 may be configured to store the imported(e.g., obtained) pharmaceutical data from the data sources 202. Forexample, the data storage layer 220 may store the pharmaceutical data asone or more datasets, such as a first dataset 222, a second dataset 224,and a third dataset 226, as shown in FIG. 2. In other implementations,the pharmaceutical data may be stored as fewer than three datasets ormore than three datasets. The datasets 222-226 may correspond todifferent types of data (e.g., physiochemical data, side effects data,toxicity data, etc.), different types of drugs or targeted diseases,different properties (e.g., particular molecular structures, particularsolubilities, etc.), or may be segregated in other manners. In someimplementations, the datasets 222-226 may be stored at one or more cloudstorage locations for further analysis and retained in different sourcefolders for downstream component analysis.

The data transformation layer 230 may be configured to pre-process andtransform the stored pharmaceutical data (e.g., the datasets 222-226)into a format that can be used as training data to ML models. The datatransformation layer 230 may include a first data flow 232, customPython scripts 234, and a second data flow 236. In otherimplementations, the data transformation layer 230 may include a singledata flow or more than two data flows, different types of scripts forprocessing and transforming data, or a combination thereof. The firstdata flow 232 and the second data flow 236 may correspond to particulardatasets, such as the first dataset 222 and the second dataset 224,respectively. The custom Python scripts 234 may be configured to performpre-processing operations, transformation operations, feature extractionoperations, training data generation operations, or a combinationthereof. For example, the custom Python scripts 234 may be configured toperform statistical analysis on the pharmaceutical data to remove ormodify an outlier from the pharmaceutical data, remove an entry from thepharmaceutical data that is associated with a variance that fails tosatisfy a variance threshold, format the pharmaceutical data,approximate a missing entry of the pharmaceutical data (e.g., usinginterpolation or other statistical modeling techniques), perform otherpre-processing operations, or a combination thereof. Additionally oralternatively, the custom Python scripts 234 may be configured toperform dimensionality reduction on the pharmaceutical data to reduce amemory footprint associated with the pharmaceutical data and to reduceprocessing complexity of the feature extraction. The dimensionalityreduction may project the pharmaceutical data onto a lower-dimensionfeature space, such as by primary component analysis, singular valuedecomposition, or the like. The custom Python scripts 234 may beconfigured to extract numerical features from the processedpharmaceutical data, or perform operations to convert text data tonumerical features. For example, the custom Python scripts 234 mayperform NLP on text data to convert the text data into numericalfeatures. The NLP may include tokenization, removing stop words,stemming, lemmatization, bag of words processing, other NLP, or acombination thereof. After extracting the features, the custom Pythonscripts 234 may vectorize or otherwise group the extracted features to aformat that may be processed by ML models to generate training data.

The AI/ML engine 240 may be configured to train and support one or moreML models to identify new (e.g., previously-unidentified) pharmaceuticalmodels. In some implementations, the AI/ML engine 240 may support one ormore multi-objective GANs 242, one or more objective-reinforced GANs244, one or more conditional deep GANs 246, one or more VAEs 248, andone or more multi-objective VAEs 249. In other implementations, theAI/ML engine 240 may support fewer ML models, more ML models, ordifferent ML models. In some implementations, the one or moremulti-objective GANs 242, the one or more objective-reinforced GANs 244,the one or more conditional deep GANs 246, the one or more VAEs 248, theone or more multi-objective VAEs 249, or a combination thereof, may beimplemented using neural networks (e.g., convolutional neural networks,deep neural networks, neural networks with hidden layers, and the like).In some other implementations, the one or more multi-objective GANs 242,the one or more objective-reinforced GANs 244, the one or moreconditional deep GANs 246, the one or more VAEs 248, the one or moremulti-objective VAEs 249, or a combination thereof, may be implementedusing other types of ML models or structures, such as decision trees,random forests, regression models, BNs, DBNs, NB models, Gaussianprocesses, HMMs, and the like.

The AI/ML engine 240 may be configured to receive training data from thedata transformation layer 230 and to provide the training data to the MLmodels 242-249 to train the ML models 242-249 to identify pharmaceuticalmolecules, as described with reference to FIG. 1. In someimplementations, training the ML models 242-249 may include keepingaside a portion of the received data as test data to test performance ofthe trained ML models (e.g., to identify whether additional trainingshould be performed, or to identify which of multiple ML models performsthe best). Additionally or alternatively, the AI/ML engine 240 may beconfigured to train the ML models 242-249 to identify pharmaceuticalmolecules having (or predicted to have) one or more selected properties.For example, one or more properties, such as a particular physiochemicalstructure, a particular molecular weight, a particular toxicity, anparticular side effect (or lack thereof), or that like, may be selectedand the training data may be generated to enable training foridentification of pharmaceutical molecules having or predicted to have)the properties, such as by weighting portions of the training data thatcorrespond to the properties or using other techniques. The selectedproperties may be indicated by user input, determined based on aparticular starting molecule, a particular target disease, or otherparameters.

In some implementations, different ML models of the ML models 242-249may be trained differently (e.g., using different training data) thanothers of the ML models 242-249. For example, the multi-objective GANs242 may be trained using multiple discriminators each associated withits own loss function using multiple objective optimization techniques,such as multiple gradient descent (MGD), hypervolume maximization (HVM),or the like. Similar training may be performed for the multi-objectiveVAEs 249. As another example, the objective-reinforced GANs 244 may betrained using reinforcement learning (RL) to bias theobjective-reinforced GANs 244 to achieving particular metrics (e.g.,objectives). As another example, the conditional deep GANs 246 may betrained to identify particular labels of molecules, such as moleculeshaving selected properties.

In some implementations, one or more of the ML models 242-249 may beconfigured to operate as a language generation model, where the languageis a description of molecules and/or molecular properties. For example,molecules may be described using strings according to SMILES notation,and the ML models 242-249 may be configured to perform SMILES toproperty (e.g., molecular properties) prediction, SMILES to latent spacemapping, latent space to SMILES mapping, latent space to propertyprediction, or a combination thereof. In some implementations, some orall of these operations may be implemented by performingmulti-objective, semi-supervised learning using ML models such as VAEs,GANs, or the like. To illustrate, the ML models may perform generativeprocesses that include generating an input variable x from a generativedistribution P_(θ)(x|y, z), which is parameterized by θ conditioned onan output variable and a latent variable z. y may be treated as anadditional latent variable when x is not labeled, which may requireintroducing the distribution over y. The prior distributions over y andz may be assumed to be p(y)=N(y|μy,Σy) and P(z)=N(Z|0, 1). A variationalinference may be used to address the intractability of the exactposterior inference of the model, such that the posterior distributionsover y and z may be approximated by qϕ(y|x)=N(y|uϕ(x)), diag (σ²ϕ(x))and qϕ(z|x, y)=N(z|uϕ(x, y)), diag (σ²ϕ(x,y)), both of which may beparameterized with φ. For the semi-supervised learning scenario wheresome values of y are missing, the missing values may be predicted byqϕ(y|x).

In some implementations, objective functions for training the ML modelsmay be based on the following functions. The conditional loss function,log p(x, y), is given by Equation 1 below.

                Equation  1-Example  Conditional  Loss  Functionlog  p(x, y) ≥ 𝔼_(q _(⌀)(z|x, y))[log p_(θ)(x|y, z) + log p(y) + log p(z) − log q_(⌀)(z|x, y)] = 𝔼_(q_(⌀)(z|x, y))[log p_(θ)(x|y, z) + log p(y) − 𝒟_(KL)(q_(⌀)(zx, y)❘p(z))] = −£(x, y)

The unconditioned loss function, log p(x), is given by Equation 2 below.

              Equation  2-Example  Unconditioned  Loss  Functionlog  p(x) ≥ 𝔼_(q_(⌀)(y, z|x))[log p_(θ)(x|y, z) + log p(y) + log p(z) − log q_(⌀)(y, z|x)] = 𝔼_(q_(⌀)(y, z|x))[log p_(θ)(x|y, z)] − 𝒟_(KL)(q_(⌀)(yx)❘p(y)) − 𝔼_(q_(⌀)(y|x))[𝒟_(KL)(q_(⌀)(zx, y)❘p(z))] = −μ(x)

The final cost function, τ, is given by Equation 3 below.

$\tau = {{\sum\limits_{{({x,y})} \sim \overset{\_}{p_{1}}}{\pounds\left( {x,y} \right)}} + {\sum\limits_{{(x)} \sim \overset{\_}{p_{\mu}}}{\mu(x)}} + {\beta*{\sum\limits_{{({x,y})} \sim \overset{\_}{p_{1}}}{{y - {{\mathbb{E}}_{q_{\varnothing{({y|x})}}}\lbrack y\rbrack}}}^{2}}}}$

Equation 3—Example Final Cost Function

The property prediction model, ŷ, is given by Equation 4 below.

ŷ˜

(μ_(∅)(x), diag (σ_(∅) ²(x)))

Equation 4—Example Property Prediction Model

The molecule generation (e.g., identification) functions are given byEquations 5 and 6 below.

{circumflex over (x)}=arg_(x)max log p _(θ)(x|y,z)

Equation 5—Example Molecule Generation Function

p _(θ)(x|y,z)=Π_(j) p _(θ)(x ^((j)) |x ⁽¹⁾ , . . . ,x ^((j−1)) ,y,z)

Equation 6—Example Molecule Generation Function

In some implementations, to identify new molecules, one or more of theML models 242-249 may be configured to perform a beam search or a beamstack search (e.g., an improved beam search). A beam search is a greedyapproach for generating new molecules where initialization of thealgorithm is using a random array of a particular size. In someimplementations, the array includes float values that are normallydistributed using Gaussian distribution. The beam search may be aheuristic approach that retains only the most promising β nodes (insteadof all nodes) at each iteration of the search for further branching. βis referred to as the Beam Width. The beam search may be an optimizationof a best-first search that uses reduced memory requirements. In someimplementations, the beam search may be performed according to thefollowing pseudocode.

OPEN = {initial state} while OPEN is not empty do {  Remove the bestnode from OPEN, call it n  If n is the goal state, back trace path to n(through recorded parents) and  return path  Create n's successors Evaluate each successor, add it to OPEN, and record its parent  If|OPEN| > β, take the best β nodes (according to heuristic) and   removethe others from OPEN } done

The access layer 250 may be configured to support one or more APIs forenabling interaction between the AI/ML engine 240 (or other componentsof the system 200) and the other devices 270 and/or user devices. Theaccess layer 250 may include one or more generated model APIs 252 andone or more other APIs 254. The generated model APIs 252 may enableinteraction between the ML models maintained by the AI/ML engine 240,such as the multi-objective GANs 242, the objective-reinforced GANs 244,the conditional deep GANs 246, the VAEs 248, the multi-objective VAEs249, or a combination thereof, with the other devices 270. The otherAPIs 254 may enable interaction between other components of the system200 and external devices, such as user devices. The API management layer260 may be configured to manage operation of the APIs supported by theaccess layer 250 (e.g., the generated model APIs 252 and the other APIs254).

The other devices 270 may include devices that interact with the system200 (e.g., the drug discovery platform), such as client devices,servers, and the like. For example, the other devices 270 may include afront end client 272, a front end server 274, and a back end server 276.The front end client 272 may be configured to enable client interactionwith the ML models maintained by the AI/ML engine 240 to enablepharmaceutical molecule identification at the front end client 272. Insome other implementations, the AI/ML engine 240 may train the ML modelsand provide configuration information associated with the ML models tothe front end client 272 such that the front end client 272 stores andoperates the ML models to perform pharmaceutical moleculeidentification. The front end server 274 and the back end server 276 maystore data used to support the AI/ML engine 240 (or other components ofthe system 200), such as training data, processed pharmaceutical data,results data, input data, and the like.

The message orchestration and logging layer 280 may be configured togenerate and transmit messages, such as to user devices, and to log themessages. For example, the message orchestration and logging layer 280may be configure to transmit messages and/or to initiate display of GUIsthat enable user interaction with the molecule identification process,such as providing user input indicating target diseases, target startingmolecules, selected properties to be associated with identifiedmolecules, and the like, or viewing information regarding the identifiedmolecules, such as names, molecular structures, expected properties,results data, or comparisons of the identified molecules topreviously-identified molecules associated with the data sources 202. Insome implementations, the message orchestration and logging layer 280may provide a single point of access for users of the system 200.

As described above, the system 200 supports training of ML models (e.g.,the multi-objective GANs 242, the objective-reinforced GANs 244, theconditional deep GANs 246, the VAEs 248, the multi-objective VAEs 249,or a combination thereof) to automatically identify pharmaceuticalmolecules. Using artificial intelligence and machine learning toidentify the pharmaceutical molecules based on the pharmaceutical datafrom the data sources 202 may result in identification of a widervariety of new (e.g., previously-unidentified) pharmaceutical moleculesand more likely to be successful pharmaceutical molecules (e.g.,pharmaceutical molecules that have a higher likelihood of treatingtarget diseases or conditions), than other drug discovery systems thatrely substantially on user input and knowledge of a human drug expert.

Referring to FIG. 3, a flow diagram of an example of a method foridentifying pharmaceutical molecules and for identifying uses forpharmaceutical molecules according to one or more aspects is shown as amethod 300. In some implementations, the operations of the method 300may be stored as instructions that, when executed by one or moreprocessors (e.g., the one or more processors of a computing device or aserver), cause the one or more processors to perform the operations ofthe method 300. In some implementations, the method 300 may be performedby one or more components of a system configured to performpharmaceutical molecule identification (e.g., drug discovery), such asone or more components of the system 100 of FIG. 1, one or morecomponents of the system 200 of FIG. 2, one or more components of asystem configured to identify uses for pharmaceutical molecules (e.g.,to screen and rank pharmaceutical molecules), or a combination thereof.

The method 300 includes collecting and selecting molecule and drug data,at 302. For example, the system may obtain pharmaceutical data from oneor more databases or data sources, as described with reference to FIGS.1 and 2. Additionally, the system may obtain binding data from one ormore binding databases, the Delaney dataset (e.g., a standard regressiondataset containing structures and water solubility data for multiplecompounds), other types of drug-protein relation data, or a combinationthereof. The binding data may indicate the likelihood ofpreviously-identified molecules binding to one or more proteins, whichmay indicate which diseases or other conditions are treatable by thepreviously-identified molecules.

The method 300 includes training one or more generative ML models, at304. For example, the system may train one or more generative ML models,such as VAEs or GANs, to identify “new” (e.g., previously-unidentified)pharmaceutical molecules, as described with reference to the ML models126 of FIG. 1 and the ML models 242-249 of FIG. 2. The method 300includes identifying one or more previously-unidentified molecules, at306. For example, the system may access the trained generative ML modelsto identify pharmaceutical molecules that are not previously-identifiedbased on the obtained pharmaceutical data. In some implementations,identifying the new pharmaceutical molecules may include conditionalidentification of molecules, at 308. For example, the generative MLmodels may be trained to identify particular types of molecules, such asmolecules having (or expected to have) selected properties or moleculesthat are to be used to cure or treat particular diseases or conditions,as non-limiting examples. Additionally or alternatively, identifying thenew pharmaceutical molecules may include unconditional identification ofmolecules, at 310. For example, the generative ML molecules may betrained to identify new pharmaceutical molecules without anyconstraints, instead based only on the underlying similarities betweenthe previously-identified molecules that are derived from the trainingdata.

The method 300 predicting a cluster to which one or more pharmaceuticalmolecules are assigned, at 312. Each of the clusters may correspond toone or more proteins to which molecules assigned to the cluster arelikely (or have been successfully observed) to bind to. Bindingmolecules to particular proteins may indicate which disease orconditions the molecules may be used to treat. To illustrate, the systemmay train one or more ML models to perform unsupervised learning tocluster previously-identified molecules into clusters corresponding tobinding proteins based on training data generated based on the obtainedpharmaceutical data, particular the data obtained from the bindingdatabases. Input data indicating one or more molecules (e.g., featurevectors generated from strings that combine various structural or otherproperties of the molecules) may be provided to the trained ML models topredict the cluster assignment using sparse subspace clustering (SSC).Clustering molecules in this manner may be referred to as limiting thesearch space for proteins/diseases associated with the molecules, whichmay be desirable due to the large chemical search space, which may be onthe order of 10{circumflex over ( )}60. In some implementations, theclustering may include density-based spatial clustering of applicationswith noise (DBSCAN), K-means clustering, K-means for large-scaleclustering (LSC-K), longest common subsequence (LCS) clustering, longestcommon cyclic subsequence (LCCS) clustering, or the like, in order tocluster large volumes of high dimensional data. In some implementations,newly-identified molecules from the generative ML models may be used asinput data to the ML models that perform the clustering to predict theclusters assigned to the newly-identified molecules. Additionally oralternatively, previously-identified molecules may be used as input datato the ML models that perform the clustering to predict other possibleproteins that could be bound to the previously-identified molecules,thereby predicting other diseases that the previously-identifiedmolecules could be used to treat. Thus, the cluster prediction mayidentify potential diseases to be treated by newly-identified moleculesas well as additional diseases that may be treated by already-releaseddrugs.

The method 300 includes generating cluster data, at 314. The clusterdata may indicate the members of each cluster, closest molecules to thecluster for target identification, proteins associated with theclusters, or a combination thereof. Additionally or alternatively, thesystem may determine scores for each molecule in a cluster to which aparticular molecule is assigned, the scores may be used to filter thecluster into a subset of higher-scored molecules, and the cluster datamay indicate the scores, the subset, or the combination thereof. Toillustrate, each molecule assigned to the cluster may be scored usingone or more scoring metrics, and the scores for a respective moleculemay be averaged to generate an average score (or other aggregated score)for each molecule. The average scores may be compared to one or morethresholds to identify a subset of higher-scored candidate molecules, toidentify one or more particular proteins to which the subset is mostlikely to bind to, or a combination thereof. In some implementations,the scoring may be performed based on Tanimoto indices or coefficients,cosine similarity values, laboratory control sample (LCS) data, Libraryfor the Enumeration of Modular Natural Structures (LEMONS) data, or thelike.

The method 300 includes storing the cluster data in a database, at 316.The cluster data may include data representing members of the clusters,proteins associated with the clusters, scores associated with members ofthe clusters, other cluster data, or the combination thereof.

The method 300 includes performing conjoint analysis on the subset ofmolecules, at 318. The conjoint analysis may indicate which propertiesor characteristics of pharmaceutical molecules are most sought after byone or more clients, such as pharmaceutical companies, universities,private research firms, and the like. To illustrate, the conjointanalysis may include providing users with multiple questions that promptthe user to choose between potential molecules having combinations ofdifferent properties (as opposed to simply prompting the user to choosedesired properties), and analyzing user input to the questions tocalculate preference scores for the properties. Although described asbeing based on user input, in some other implementations, the conjointanalysis may be performed based on extracted or other data minedinformation, such as from company press releases indicating new drugs orareas of research, market valuations of particular drugs or potentialdrugs for curing particular diseases, other information, or the like. Insome implementations, one or more ML models may be trained to predictpreference scores for input candidate molecules based on training dataderived from user responses or other historical information associatedwith previously-identified molecules or released drugs.

The method 300 includes ranking the subset of molecules based on theconjoint analysis, at 320. For example, based on scores determinedduring the conjoint analysis, the subset of molecules may be ranked and,optionally, further filtered based on one or more thresholds. The method300 concludes by output recommendations for one or more molecules for usin drug testing and production, at 322. Due to the clustering andranking, the recommended molecules may be more likely to result inuseful or marketable drugs, and therefore more likely to result inshorter testing/development cycles and increased revenue to the clients.

Referring to FIG. 4, a flow diagram of an example of a method forpharmaceutical molecule identification using machine learning accordingto one or more aspects is shown as a method 400. In someimplementations, the operations of the method 400 may be stored asinstructions that, when executed by one or more processors (e.g., theone or more processors of a computing device or a server), cause the oneor more processors to perform the operations of the method 400. In someimplementations, the method 400 may be performed by a computing device,such as the computing device 102 of FIG. 1 (e.g., a computing deviceconfigured for pharmaceutical molecule identification or drugdiscovery), one or more components of the system 200 of FIG. 2, or acombination thereof.

The method 400 includes obtaining pharmaceutical data indicatingproperties of previously-identified pharmaceutical molecules from one ormore databases, at 402. The pharmaceutical data includes molecularphysiochemical data, drug impact data, side effect data, toxicity data,solubility data, or a combination thereof. For example, thepharmaceutical data may include or correspond to the pharmaceutical data132 of FIG. 1, which may include the physiochemical data 152, the drugimpact data 154, the side effect data 156, the toxicity data 158, thesolubility data 160, or a combination thereof.

The method 400 also includes performing NLP on at least a portion of thepharmaceutical data to convert the at least a portion of thepharmaceutical data to training data, at 404. The training data includesvectorized representations of the properties of thepreviously-identified pharmaceutical molecules. For example, the dataprocessing and transformation engine 122 of FIG. 1 may perform NLP on aportion of the pharmaceutical data 132 to generate the training data110. The method 400 further includes training, by the one or moreprocessors, one or more ML models based on the training data toconfigure the one or more ML models to identify additionalpharmaceutical molecules, at 406. The additional pharmaceuticalmolecules are distinct from the previously-identified pharmaceuticalmolecules. For example, the one or more ML models and the additionalpharmaceutical molecules may include or correspond to the ML models 126and the identified molecules 112, respectively, of FIG. 1.

In some implementations, the method 400 may also include generating anoutput that indicates one or more molecules identified by the one ormore ML models. For example, the output may include or correspond to theoutput 134 of FIG. 1. In some such implementations, the method 400 mayalso include initiating, based on the output, display of a GUI thatindicates the one or more molecules. For example, the output 134 may beprovided to the display device 130 of FIG. 1 to cause display of a GUIthat indicates the identified molecules 112. Additionally oralternatively, the method 400 may also include providing the one or moreML models to a client device for pharmaceutical model identification bythe client device. For example, configuration information (e.g.,parameters, hyper-parameters, and the like) associated with the MLmodels 126 may be provided to the client device 162 of FIG. 1 to enablepharmaceutical molecule identification at the client device 162.Additionally or alternatively, generating the output may includetransmitting an instruction to an automated or semi-automated system tocause the automated or semi-automated system to initiate development ofsamples of the one or more molecules. For example, the output 134 ofFIG. 1 may include one or more instructions that are provided to thedrug production system 164 to cause performance of one or moreoperations by the drug production system 164 to develop samples of theidentified molecules 112.

In some implementations, the method 400 may further include generatingadditional training data based on one or more molecules identified bythe one or more ML models, testing data associated with the one or moremolecules, or a combination thereof, and training the one or more MLmodels based on the additional training data. For example, theadditional training data may include or correspond to the additionaltraining data 116 of FIG. 1. Additionally or alternatively, thepharmaceutical data may include SMILES-formatted data, and the NLP maybe performed on the SMILES-formatted data to generate the training data.For example, the data processing and transformation engine 122 or thetraining engine 124 may perform NLP on at least a portion of thepharmaceutical data 132 of FIG. 1 to generate the training data 110.Additionally or alternatively, a first subset of the pharmaceutical datamay be associated with previously-identified pharmaceutical moleculeshaving one or more particular properties, a second subset of thepharmaceutical data may be associated with previously-identifiedpharmaceutical molecules that do not have the one or more particularproperties, and the one or more ML models may be trained to identify theadditional pharmaceutical molecules having the one or more particularproperties. For example, a first portion of the pharmaceutical data 132may be associated with previously-identified molecules that have theselected properties 114, a second portion of the pharmaceutical data 132may be associated with previously-identified molecules that do not havethe selected properties 114, and the ML models 126 may be trained toconditionally identify the identified molecules 112 such that theidentified molecules 112 have (or are predicted to have) the selectedproperties 114.

In some implementations, the one or more databases may include the ZINCdatabase, the chEMBL database, the PubChem database, or a combinationthereof. For example, the databases 150 may include one or morepublically available molecular information databases, such as the ZINCdatabase 206, the chEMBL database 209, the PubChem database, or acombination thereof. Additionally or alternatively, the method 400 mayalso include performing pre-processing on the pharmaceutical data priorto performing the NLP, performing dimensionality reduction on thepharmaceutical data prior to performing the NLP, or a combinationthereof. For example, the data processing and transformation engine 122of FIG. 1 may perform pre-processing, such as formatting, outlierremoval, missing entry replacement, dimensionality reduction, otherpre-processing, or a combination thereof.

In some implementations, the one or more ML models may include one ormore GANs, one or more VAEs, or a combination thereof. For example, theML models 126 may include or correspond to GANs, VAEs, or both, such asthe multi-objective GANs 242, the objective-reinforced GANs 244, theconditional deep GANs 246, the VAEs 248, and the multi-objective VAEs249 described with reference to FIG. 2. Additionally or alternatively,obtaining the pharmaceutical data may include receiving a portion of thepharmaceutical data from the one or more databases, pulling a portion ofthe pharmaceutical data from the one or more databases, extracting aportion of the pharmaceutical data from information presented by the oneor more databases, or a combination thereof. For example, the computingdevice 102 may obtain the pharmaceutical data 132 by querying andreceiving at least a portion of the pharmaceutical data 132 (similar tothe Python scripts 212 of FIG. 2), extracting at least a portion of thepharmaceutical data 132 from websites or other documents supported bythe databases 150 (similar to the crawler 214 of FIG. 2), performing oneor more pull operations to retrieve at least a portion of thepharmaceutical data 132 (similar to the manual pull logic 216 of FIG.2), or a combination thereof.

In some implementations, the one or more databases may include one ormore publically-available databases, one or more proprietary databases,one or more third-party databases, or a combination thereof. Forexample, the ML models 126 of FIG. 1 may include publically availabledatabases (e.g., the ZINC database, the chEMBL database, the PubChemdatabase, and the like), proprietary databases (e.g., pharmaceuticalinformation databases maintained and operated by an operator of thecomputing device 102 or the client device 162), third-party databases(e.g., databases maintained and operated by other drug companies,universities, government agencies, and the like), or a combinationthereof. Additionally or alternatively, the one or more ML models mayinclude a GAN, a VAE, and a multi-objective VAE. For example, the AI/MLengine 240 of FIG. 2 may train and support one or more GANs (e.g., themulti-objective GANs 242, the objective-reinforced GANs 244, theconditional deep GANs 246, or a combination thereof), the VAEs 248, andthe multi-objective VAEs 249.

In some implementations, the method 400 may further include initiatingdisplay of a GUI that indicates one or more molecules identified by theone or more ML models, providing the one or more ML models to a clientdevice for pharmaceutical model identification by the client device, ora combination thereof. For example, the output 134 of FIG. 1 may beprovided to the display device 130 to cause display of a GUI thatindicates the identified molecules 112 at the display device 130, orconfiguration data associated with the trained ML models 126 may beprovided to the client device 162 to enable configuration and use of MLmodels at the client device 162 for performing molecule identification(e.g., drug discovery). Additionally or alternatively, the method 400may further include receiving a user input indicating one or moreparticular properties and training the one or more ML models to identifythe additional pharmaceutical molecules having the one or moreparticular properties. For example, the computing device 102 of FIG. 1may receive a user input (e.g., from a user device) indicating theselected properties 114 for use in training the ML models 126 such thatthe identified molecules 112 have (or are predicted to have) theselected properties 114.

It is noted that other types of devices and functionality may beprovided according to aspects of the present disclosure and discussionof specific devices and functionality herein have been provided forpurposes of illustration, rather than by way of limitation. It is notedthat the operations of the method 300 of FIG. 3 and the method 400 ofFIG. 4 may be performed in any order, or that operations of one methodmay be performed during performance of another method, such as themethod 400 of FIG. 4 including one or more operations of the method 300of FIG. 3. It is also noted that the method 300 of FIG. 3 and the method400 of FIG. 4 may also include other functionality or operationsconsistent with the description of the operations of the system 100 ofFIG. 1 and/or the system 200 of FIG. 2.

Those of skill in the art would understand that information and signalsmay be represented using any of a variety of different technologies andtechniques. For example, data, instructions, commands, information,signals, bits, symbols, and chips that may be referenced throughout theabove description may be represented by voltages, currents,electromagnetic waves, magnetic fields or particles, optical fields orparticles, or any combination thereof.

The functional blocks and modules described herein (e.g., the functionalblocks and modules in FIGS. 1-4) may comprise processors, electronicsdevices, hardware devices, electronics components, logical circuits,memories, software codes, firmware codes, etc., or any combinationthereof. In addition, features discussed herein relating to FIGS. 1-4may be implemented via specialized processor circuitry, via executableinstructions, and/or combinations thereof.

As used herein, various terminology is for the purpose of describingparticular implementations only and is not intended to be limiting ofimplementations. For example, as used herein, an ordinal term (e.g.,“first,” “second,” “third,” etc.) used to modify an element, such as astructure, a component, an operation, etc., does not by itself indicateany priority or order of the element with respect to another element,but rather merely distinguishes the element from another element havinga same name (but for use of the ordinal term). The term “coupled” isdefined as connected, although not necessarily directly, and notnecessarily mechanically; two items that are “coupled” may be unitarywith each other. The terms “a” and “an” are defined as one or moreunless this disclosure explicitly requires otherwise. The term“substantially” is defined as largely but not necessarily wholly what isspecified—and includes what is specified; e.g., substantially 90 degreesincludes 90 degrees and substantially parallel includes parallel—asunderstood by a person of ordinary skill in the art. In any disclosedaspect, the term “substantially” may be substituted with “within [apercentage] of” what is specified, where the percentage includes 0.1, 1,5, and 10 percent; and the term “approximately” may be substituted with“within 10 percent of” what is specified. The phrase “and/or” means andor. To illustrate, A, B, and/or C includes: A alone, B alone, C alone, acombination of A and B, a combination of A and C, a combination of B andC, or a combination of A, B, and C. In other words, “and/or” operates asan inclusive or. Additionally, the phrase “A, B, C, or a combinationthereof” or “A, B, C, or any combination thereof” includes: A alone, Balone, C alone, a combination of A and B, a combination of A and C, acombination of B and C, or a combination of A, B, and C.

The terms “comprise” and any form thereof such as “comprises” and“comprising,” “have” and any form thereof such as “has” and “having,”and “include” and any form thereof such as “includes” and “including”are open-ended linking verbs. As a result, an apparatus that“comprises,” “has,” or “includes” one or more elements possesses thoseone or more elements, but is not limited to possessing only thoseelements. Likewise, a method that “comprises,” “has,” or “includes” oneor more steps possesses those one or more steps, but is not limited topossessing only those one or more steps.

Any implementation of any of the apparatuses, systems, and methods canconsist of or consist essentially of—rather thancomprise/include/have—any of the described steps, elements, and/orfeatures. Thus, in any of the claims, the term “consisting of” or“consisting essentially of” can be substituted for any of the open-endedlinking verbs recited above, in order to change the scope of a givenclaim from what it would otherwise be using the open-ended linking verb.Additionally, it will be understood that the term “wherein” may be usedinterchangeably with “where.”

Further, a device or system that is configured in a certain way isconfigured in at least that way, but it can also be configured in otherways than those specifically described. Aspects of one example may beapplied to other examples, even though not described or illustrated,unless expressly prohibited by this disclosure or the nature of aparticular example.

Those of skill would further appreciate that the various illustrativelogical blocks, modules, circuits, and algorithm steps (e.g., thelogical blocks in FIGS. 1-4) described in connection with the disclosureherein may be implemented as electronic hardware, computer software, orcombinations of both. To clearly illustrate this interchangeability ofhardware and software, various illustrative components, blocks, modules,circuits, and steps have been described above generally in terms oftheir functionality. Whether such functionality is implemented ashardware or software depends upon the particular application and designconstraints imposed on the overall system. Skilled artisans mayimplement the described functionality in varying ways for eachparticular application, but such implementation decisions should not beinterpreted as causing a departure from the scope of the presentdisclosure. Skilled artisans will also readily recognize that the orderor combination of components, methods, or interactions that aredescribed herein are merely examples and that the components, methods,or interactions of the various aspects of the present disclosure may becombined or performed in ways other than those illustrated and describedherein.

The various illustrative logical blocks, modules, and circuits describedin connection with the disclosure herein may be implemented or performedwith a general-purpose processor, a digital signal processor (DSP), anASIC), a field programmable gate array (FPGA) or other programmablelogic device, discrete gate or transistor logic, discrete hardwarecomponents, or any combination thereof designed to perform the functionsdescribed herein. A general-purpose processor may be a microprocessor,but in the alternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing devices, e.g., a combinationof a DSP and a microprocessor, a plurality of microprocessors, one ormore microprocessors in conjunction with a DSP core, or any other suchconfiguration.

The steps of a method or algorithm described in connection with thedisclosure herein may be implemented directly in hardware, in a softwaremodule executed by a processor, or in a combination of the two. Asoftware module may reside in RAM memory, flash memory, ROM memory,EPROM memory, EEPROM memory, registers, hard disk, a removable disk, aCDROM, or any other form of storage medium known in the art. Anexemplary storage medium is coupled to the processor such that theprocessor can read information from, and write information to, thestorage medium. In the alternative, the storage medium may be integralto the processor. The processor and the storage medium may reside in anASIC. The ASIC may reside in a user terminal. In the alternative, theprocessor and the storage medium may reside as discrete components in auser terminal.

In one or more exemplary designs, the functions described may beimplemented in hardware, software, firmware, or any combination thereof.If implemented in software, the functions may be stored on ortransmitted over as one or more instructions or code on acomputer-readable medium. Computer-readable media includes both computerstorage media and communication media including any medium thatfacilitates transfer of a computer program from one place to another.Computer-readable storage media may be any available media that can beaccessed by a general purpose or special purpose computer. By way ofexample, and not limitation, such computer-readable media can compriseRAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic diskstorage or other magnetic storage devices, or any other medium that canbe used to carry or store desired program code means in the form ofinstructions or data structures and that can be accessed by ageneral-purpose or special-purpose computer, or a general-purpose orspecial-purpose processor. Also, a connection may be properly termed acomputer-readable medium. For example, if the software is transmittedfrom a website, server, or other remote source using a coaxial cable,fiber optic cable, twisted pair, or digital subscriber line (DSL), thenthe coaxial cable, fiber optic cable, twisted pair, or DSL, are includedin the definition of medium. Disk and disc, as used herein, includescompact disc (CD), laser disc, optical disc, digital versatile disc(DVD), hard disk, solid state disk, and blu-ray disc where disks usuallyreproduce data magnetically, while discs reproduce data optically withlasers. Combinations of the above should also be included within thescope of computer-readable media.

The above specification and examples provide a complete description ofthe structure and use of illustrative implementations. Although certainexamples have been described above with a certain degree ofparticularity, or with reference to one or more individual examples,those skilled in the art could make numerous alterations to thedisclosed implementations without departing from the scope of thisdisclosure. As such, the various illustrative implementations of themethods and systems are not intended to be limited to the particularforms disclosed. Rather, they include all modifications and alternativesfalling within the scope of the claims, and examples other than the oneshown may include some or all of the features of the depicted example.For example, elements may be omitted or combined as a unitary structure,and/or connections may be substituted. Further, where appropriate,aspects of any of the examples described above may be combined withaspects of any of the other examples described to form further exampleshaving comparable or different properties and/or functions, andaddressing the same or different problems. Similarly, it will beunderstood that the benefits and advantages described above may relateto one aspect or may relate to several implementations.

The claims are not intended to include, and should not be interpreted toinclude, means plus- or step-plus-function limitations, unless such alimitation is explicitly recited in a given claim using the phrase(s)“means for” or “step for,” respectively.

Although the aspects of the present disclosure and their advantages havebeen described in detail, it should be understood that various changes,substitutions and alterations can be made herein without departing fromthe spirit of the disclosure as defined by the appended claims.Moreover, the scope of the present application is not intended to belimited to the particular implementations of the process, machine,manufacture, composition of matter, means, methods and steps describedin the specification. As one of ordinary skill in the art will readilyappreciate from the present disclosure, processes, machines,manufacture, compositions of matter, means, methods, or steps, presentlyexisting or later to be developed that perform substantially the samefunction or achieve substantially the same result as the correspondingaspects described herein may be utilized according to the presentdisclosure. Accordingly, the appended claims are intended to includewithin their scope such processes, machines, manufacture, compositionsof matter, means, methods, or steps.

1. A method for pharmaceutical molecule identification using machinelearning, the method comprising: obtaining, by one or more processors,pharmaceutical data indicating properties of previously-discoveredpharmaceutical molecules from one or more databases, wherein thepharmaceutical data includes molecular physiochemical data and one ormore of: drug impact data, side effect data, toxicity data, solubilitydata, or a combination thereof; performing, by the one or moreprocessors, natural language processing (NLP) on at least a portion ofthe pharmaceutical data to convert the at least a portion of thepharmaceutical data to training data, wherein the training datacomprises vectorized representations of the properties of thepreviously-discovered pharmaceutical molecules; training, by the one ormore processors, one or more machine learning (ML) models based on thetraining data to configure the one or more ML models to identifyadditional pharmaceutical molecules, wherein the additionalpharmaceutical molecules are distinct from the previously-discoveredpharmaceutical molecules; and generating, by the one or more processorsmodels; and further training, by the one or more processors, the one ormore ML models based an output that indicates one or more moleculesidentified by the one or more ML models, wherein at least a portion ofthe output comprises simplified molecular-input line-entry system(SMILES) representations of the one or more molecules identified by theone or more ML models.
 2. The method of claim 1, further comprising:generating, by the one or more processors and after training the one ormore ML models based on the training data, additional training databased on testing data that indicates properties of at least one of theadditional pharmaceutical molecules identified by the one or more MLmodels; and further training, by the one or more processors, the one ormore ML models based on the additional training data.
 3. The method ofclaim 1, further comprising initiating, by the one or more processorsand based on the output, display of a graphical user interface (GUI)that indicates the one or more molecules.
 4. The method of claim 1,further comprising providing, by the one or more processors, the one ormore ML models to a client device for pharmaceutical modelidentification by the client device.
 5. The method of claim 1, wherein:generating the output comprises transmitting an instruction to anautomated or semi-automated system; and the instruction is executable bythe automated or semi-automated system to cause formation of samples ofthe one or more molecules.
 6. (canceled)
 7. The method of claim 1,wherein: the pharmaceutical data comprises simplified molecular-inputline-entry system (SMILES)-formatted data, and the NLP is performed onthe SMILES-formatted data to generate the training data.
 8. The methodof claim 1, wherein: a first subset of the pharmaceutical data isassociated with previously-discovered pharmaceutical molecules havingone or more particular properties, a second subset of the pharmaceuticaldata is associated with previously-discovered pharmaceutical moleculesthat do not have the one or more particular properties, and the one ormore ML models are trained to identify the additional pharmaceuticalmolecules having the one or more particular properties.
 9. The method ofclaim 1, wherein the one or more molecules identified by the one or moreML models comprise different combinations of elements than thepreviously-discovered pharmaceutical molecules, different molecularstructures than the previously-discovered pharmaceutical molecules, ordifferent combinations of elements and different molecular structuresthan the previously-discovered pharmaceutical molecules.
 10. The methodof claim 1, further comprising: performing, by the one or moreprocessors, pre-processing on the pharmaceutical data prior toperforming the NLP; performing, by the one or more processors,dimensionality reduction on the pharmaceutical data prior to performingthe NLP; or a combination thereof.
 11. The method of claim 1, whereinthe one or more ML models comprise one or more generational adversarialnetworks (GANs), one or more variational autoencoders (VAEs), or acombination thereof.
 12. The method of claim 1, wherein: obtaining thepharmaceutical data comprises receiving a portion of the pharmaceuticaldata from the one or more databases, pulling a portion of thepharmaceutical data from the one or more databases, extracting a portionof the pharmaceutical data from information presented by the one or moredatabases, or a combination thereof; and the one or more databasescomprise the ZINC database, the chEMBL database, the PubChem database,or a combination thereof.
 13. A system for pharmaceutical moleculeidentification using machine learning, the system comprising: a memory;and one or more processors communicatively coupled to the memory, theone or more processors configured to: obtain pharmaceutical dataindicating properties of previously-discovered pharmaceutical moleculesfrom one or more databases, wherein the pharmaceutical data includesmolecular physiochemical data and one or more of: drug impact data, sideeffect data, toxicity data, solubility data, or a combination thereof;perform natural language processing (NLP) on at least a portion of thepharmaceutical data to convert the at least a portion of thepharmaceutical data to training data, wherein the training datacomprises vectorized representations of the properties of thepreviously-discovered pharmaceutical molecules; train one or moremachine learning (ML) models based on the training data to configure theone or more ML models to identify additional pharmaceutical molecules,wherein the additional pharmaceutical molecules are distinct from thepreviously-discovered pharmaceutical molecules; and generate an outputthat indicates one or more molecules identified by the one or more MLmodels, wherein at least a portion of the output comprises simplifiedmolecular-input line-entry system (SMILES) representations of the one ormore molecules identified by the one or more ML models.
 14. The systemof claim 13, wherein the one or more ML models comprises one or moregenerative models configured to generate the additional pharmaceuticalmolecules, the additional pharmaceutical molecules comprising newexamples of molecules that have common relationships as thepreviously-discovered pharmaceutical molecules.
 15. The system of claim13, further comprising one or more interfaces configured to enablecommunication with the one or more databases, a display device, a clientdevice, a drug production system, or a combination thereof.
 16. Thesystem of claim 13, wherein the one or more databases comprise one ormore publically-available databases, one or more proprietary databases,one or more third-party databases, or a combination thereof.
 17. Thesystem of claim 13, wherein: the one or more ML models comprise agenerative adversarial network (GAN), a variational autoencoder (VAE),and a multi-objective VAE; the GAN is configured to be trained usingreinforcement learning to bias achievement of particular objectives; andthe multi-objective VAE is configured to be trained using multiplediscriminators that are each associated with a respective loss function.18. A non-transitory computer-readable storage medium storinginstructions that, when executed by one or more processors, cause theone or more processors to perform operations for pharmaceutical moleculeidentification using machine learning, the operations comprising:obtaining pharmaceutical data indicating properties ofpreviously-discovered pharmaceutical molecules from one or moredatabases, wherein the pharmaceutical data includes molecularphysiochemical data and one or more of: drug impact data, side effectdata, toxicity data, solubility data, or a combination thereof;performing natural language processing (NLP) on at least a portion ofthe pharmaceutical data to convert the at least a portion of thepharmaceutical data to training data, wherein the training datacomprises vectorized representations of the properties of thepreviously-discovered pharmaceutical molecules; training one or moremachine learning (ML) models based on the training data to configure theone or more ML models to identify additional pharmaceutical molecules,wherein the additional pharmaceutical molecules are distinct from thepreviously-discovered pharmaceutical molecules; and generating an outputthat indicates one or more molecules identified by the one or more MLmodels, wherein at least a portion of the output comprises simplifiedmolecular-input line-entry system (SMILES) representations of the one ormore molecules identified by the one or more ML models.
 19. Thenon-transitory computer-readable storage medium of claim 18, wherein theoperations further comprise: initiating display of a graphical userinterface (GUI) that indicates one or more molecules identified by theone or more ML models; providing the one or more ML models to a clientdevice for pharmaceutical model identification by the client device; ora combination thereof.
 20. The non-transitory computer-readable storagemedium of claim 18, wherein the operations further comprise: receiving auser input indicating one or more particular properties; and trainingthe one or more ML models to identify the additional pharmaceuticalmolecules having the one or more particular properties.
 21. The methodof claim 5, wherein the instruction is executable by the automated orsemi-automated system to initiate mixing of one or more chemicals, toactivate a heater or cooler to change a state of a chemical, to causeretrieval of one or more chemicals from a storage location, or acombination thereof.